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Abstract 

We  present  mo  rphogen,  a  tool  for  improving  translation  into  morphologically  rich  languages 
with  synthetic  phrases.  We  approach  the  problem  of  translating  into  morphologically  rich  lan¬ 
guages  in  two  phases.  First,  an  inflection  model  is  learned  to  predict  target  word  inflections 
from  source  side  context.  Then  this  model  is  used  to  create  additional  sentence  specific  trans¬ 
lation  phrases.  These  "synthetic  phrases"  augment  the  standard  translation  grammars  and 
decoding  proceeds  normally  with  a  standard  translation  model.  We  present  an  open  source 
Python  implementation  of  our  method,  as  well  as  a  method  of  obtaining  an  unsupervised  mor¬ 
phological  analysis  of  the  target  language  when  no  supervised  analyzer  is  available. 


1.  Introduction 

Machine  translation  into  morphologically  rich  languages  is  challenging,  due  to  lex¬ 
ical  sparsity  on  account  of  grammatical  features  being  expressed  with  morphology. 
In  this  paper,  we  present  an  open-source  Python  tool,  mo  rphogen,  that  leverages  target 
language  morphological  grammars  (either  hand-crafted  or  learned  unsupervisedly) 
to  enable  prediction  of  highly  inflected  word  forms  from  rich,  source  language  syn¬ 
tactic  information^ 

Unlike  previous  approaches  to  translation  into  morphologically  rich  languages, 
our  tool  constructs  sentence-specific  translation  grammars  (i.e.,  phrase  tables)  for  each 
sentence  that  is  to  be  translated,  but  then  uses  a  standard  decoder  to  generate  the  final 

1  https : //git hub. com/eschling/mo rphogen 
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translation  with  no  post-processing.  The  advantages  of  our  approach  are:  (i)  newly 
synthesized  forms  are  highly  targeted  to  a  specific  translation  context;  (ii)  multiple 
alternatives  can  be  generated  with  the  final  choice  among  rules  left  to  a  standard 
sentence-level  translation  model;  (iii)  our  technique  requires  virtually  no  language- 
specific  engineering;  and  (iv)  we  can  generate  forms  that  were  not  observed  in  the 
bilingual  training  data. 

This  paper  is  structured  as  follows.  We  first  describe  our  "translate-and-inflect" 
model  that  is  used  to  synthesize  the  target  side  of  lexical  translations  rule  given  its 
source  and  its  source  context  (§2).  This  model  discriminates  between  inflectional  op¬ 
tions  for  predicted  stems,  and  the  set  of  inflectional  possibilities  is  determined  by  a 
morphological  grammar.  To  obtain  this  morphological  grammar,  the  user  may  either 
provide  a  morphologically  analyzed  version  of  their  target  language  training  data,  or 
a  simple  unsupervised  morphology  learner  can  be  used  instead  (§3).  With  the  mor¬ 
phologically  analyzed  parallel  data,  the  parameters  of  the  discriminative  model  are 
trained  from  the  complete  parallel  training  data  using  an  efficient  optimization  pro¬ 
cedure  that  does  not  require  a  decoder. 

At  test  time,  our  tool  creates  synthetic  phrases  representing  likely  inflections  of 
likely  stem  translations  for  each  sentence  (§4).  We  briefly  present  the  results  of  our 
system  on  English-Russian,  -Hebrew,  and  -Swahili  translation  tasks  (§5),  and  then 
describe  our  open  source  implementation,  and  discuss  how  to  use  it  with  both  user- 
provided  morphological  analyses  and  those  of  our  unsupervised  morphological  an- 
alyzer^  (§6). 

2.  Translate-and-inflect  Model 

The  task  of  the  translate-and-inflect  model  is  illustrated  in  Figure  1  for  an  En¬ 
glish-Russian  sentence  pair.  The  input  is  a  sentence  e  in  the  source  language-  to¬ 
gether  with  any  available  linguistic  analysis  of  e  (e.g.,  its  dependency  parse).  The 
output  f  consists  of  (i)  a  sequence  of  stems,  each  denoted  cr,  and  (ii)  one  morpholog¬ 
ical  inflection  pattern  for  each  stem,  denoted  pi  Throughout,  we  use  0.ff  to  denote 
the  set  of  possible  morphological  inflection  patterns  for  a  given  stem  cr.  £1^  might  be 


2Further  documentation  is  available  in  the  morphogen  repository. 

3In  this  paper,  the  source  language  is  always  English.  We  use  e  to  denote  the  source  language  (rather 
than  the  target  language),  to  emphasize  the  fact  that  we  are  translating  from  a  morphologically  impover¬ 
ished  language  to  a  morphologically  rich  one. 

4When  the  information  is  available  from  the  morphological  analyzer,  a  stem  tr  is  represented  as  a  tuple 
of  a  lemma  and  its  inflectional  class. 
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a:  nbiTaTbCfl_V  +  |i:mis-sfm-e 


OHa  nbiTa/iacb  nepecenb  nyTi/i  Ha  ee  Be/ioci/ineA 


she  had  attempted  to  cross  he  road  on  her  bike 

C50  C473  C28  C8  0275  C37  C43  C82  C94  0331 


PRP  VBD  VBN  TO  VB  DT  NN  IN  PRP$  NN 


nsubj  root  xcomp 


Figure  1.  The  inflection  model  predicts  a  form  for  the  target  verb  stem  based  on  its 
source  attempted  and  the  linear  and  syntactic  source  context.  The  inflection  pattern 
mis-  sfm-e  ('main  +  indicative+past+singular+feminine+medial  +  perfective)  is  that  of 

a  supervised  analyzer. 


defined  by  a  grammar;  our  models  restrict  to  be  the  set  of  inflections  observed 
anywhere  in  our  monolingual  or  bilingual  training  data  as  a  realization  of  cr.- 

We  define  a  probabilistic  model  over  target  words  f.  The  model  assumes  inde¬ 
pendence  between  each  target  word  f  conditioned  on  the  source  sentence  e  and  its 
aligned  position  i  in  this  sentence.-  This  assumption  is  further  relaxed  in  §4  when 
the  model  is  integrated  in  the  translation  system.  The  probability  of  generating  each 
target  word  f  is  decomposed  as  follows: 

p(f  I  e,i)  =  Y-  P(gl  ep  x  p(p|  g,e,i)- 

U  ^  gen.  stem  gen.  inflection 

Here,  each  stem  is  generated  independently  from  a  single  aligned  source  word  e*,  but 
in  practice  we  use  a  standard  phrase-based  model  to  generate  sequences  of  stems  and 
only  the  inflection  model  operates  word-by-word. 

2.1.  Modeling  Inflection 

In  morphologically  rich  languages,  each  stem  may  be  combined  with  one  or  more 
inflectional  morphemes  to  express  different  grammatical  features  (e.g.,  case,  definite¬ 
ness,  etc.).  Since  the  inflectional  morphology  of  a  word  generally  expresses  multiple 
features,  we  use  a  model  that  uses  overlapping  features  in  its  representation  of  both 

sThis  is  a  practical  decision  that  prevents  the  model  from  generating  words  that  would  be  difficult  for  a 
closed-vocabulary  language  model  to  reliably  score.  When  open-vocabulary  language  models  are  available, 
this  restriction  can  easily  be  relaxed. 

6This  is  the  same  assumption  that  Brown  et  al.  (1993)  make  in,  for  example,  IBM  Model  1. 
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source  aligned  word  ei 
parent  word  e7ti  with  its  dependency  7tt  — t  i 
all  children  e-}  |  7tj  =  i  with  their  dependency  i  — >  j 
source  words  ei_i  and  et+i 


token 

part-of-speech  tag 
word  cluster 


are  ei,  e7Ii  at  the  root  of  the  dependency  tree? 
-  number  of  children,  siblings  of 


Table  1.  Source  features  cp(e,i)  extracted  from  e  and  its  linguistic  analysis. 
denotes  the  parent  of  the  token  in  position  i  in  the  dependency  tree  and  7tj.  — >  i  the 

typed  dependency  link. 


the  input  (i.e.,  conditioning  context)  and  output  (i.e.,  the  inflection  pattern): 


exp  [<p(e, i)TW\[>(|i.)  +  tp(p)TVrp(p)] 


p(p  |  u,e,i)  = 


(1) 


I^eQ(I  exp  [cp(e,i)TWip(p')  +  ip(|i.')TVrp(|j/)]  ’ 


Here,  cp  is  an  m-dimensional  source  context  feature  vector  function,  ip  is  an  n-dimen- 
sional  morphology  feature  vector  function,  W  is  an  m  x  n  parameter  matrix,  and  V  is 
an  n  x  n  parameter  matrix.  In  our  implementation,  <p  and  tp  return  sparse  vectors  of 
binary  indicator  features,  but  other  features  can  easily  be  incorporated. 

2.2.  Source  Contextual  Features:  cp(e,i) 

In  order  to  select  the  best  inflection  of  a  target-language  word,  given  the  source 
word  it  translates  from  and  the  context  of  that  source  word,  we  seek  to  leverage  numer¬ 
ous  features  of  the  context  to  capture  the  diversity  of  possible  grammatical  relations 
that  might  be  encoded  in  the  target  language  morphology.  Consider  the  example 
shown  in  Figure  1,  where  most  of  the  inflection  features  of  the  Russian  word  (past 
tense,  singular  number,  and  feminine  gender)  can  be  inferred  from  the  context  of  the 
source  word  it  is  aligned  to.  To  access  this  information,  our  tool  uses  parsers  and  other 
linguistic  analyzers. 

By  default,  we  assume  that  English  is  the  source  language  and  provide  wrappers 
for  external  tools  to  generate  the  following  linguistic  analyses  of  each  input  sentence: 

•  Part-of-speech  tagging  with  a  CRF  tagger  trained  on  sections  02-21  of  the  Penn 
Treebank, 

•  Dependency  parsing  with  TurboParser  (Martins  et  ah,  2010),  and 

•  Mapping  of  the  tokens  to  one  of  600  Brown  clusters  trained  from  8B  words  of 
English  text.- 


7The  entire  monolingual  data  available  for  the  translation  task  of  the  8th  ACL  Workshop  on  Statisti¬ 
cal  Machine  Translation  was  used.  These  clusters  are  available  at  http://www.ark.cs.cmu.edu/cdyer/ 
en- c600 . gz 
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From  these  analyses  we  then  extract  features  from  e  by  considering  the  aligned  source 
word  ei,  its  preceding  and  following  words,  and  its  dependency  neighbors.  These 
are  detailed  in  Table  1  and  can  be  easily  modified  to  include  different  features  or  for 
different  source  languages. 

3.  Morphological  Grammars  and  Features 

The  discriminative  model  in  the  previous  section  selects  an  inflectional  pattern 
for  each  candidate  stem.  In  this  section,  we  discuss  where  the  inventory  of  possible 
inflectional  patterns  it  will  consider  come  from. 

3.1.  Supervised  Morphology 

If  a  target  language  morphological  analyzer  is  available  that  analyses  each  word 
in  the  target  of  the  bitext  and  monolingual  training  data  into  a  stem  and  vector  of 
grammatical  features,  the  inflectional  vector  may  be  used  directly  to  define  rp  ( p)  by 
defining  a  binary  feature  for  each  key-value  pair  (e.g..  Ten  se=pa  st )  composing  the  tag. 
Prior  to  running  morphogen,  the  full  monolingual  and  target  side  bilingual  training 
data  should  be  analyzed. 

3.2.  Unsupervised  Morphology 

Supervised  morphological  analyzers  that  map  between  inflected  word  forms  and 
abstract  grammatical  feature  representations  (e.g.,  past+sing)  are  not  available  for  ev¬ 
ery  language  into  which  we  might  seek  to  translate.  We  therefore  provide  an  unsn- 
pervised  model  of  morphology  that  segments  words  into  sequences  of  morphemes, 
assuming  a  concatenative  generation  process  and  a  single  analysis  per  type.  To  do  so, 
we  assume  that  each  word  can  be  decomposed  into  any  number  of  prefixes,  a  stem, 
and  any  number  of  suffixes.  Formally,  we  let  M  represent  the  set  of  all  possible  mor¬ 
phemes  and  define  a  regular  grammar  M*MM*  (i.e.,  zero  or  more  prefixes,  a  stem, 
and  zero  or  more  suffixes).  We  learn  weights  for  this  grammar  by  assuming  that  the 
probability  of  each  prefix,  stem,  and  suffix  is  given  by  a  draw  from  a  Dirichlet  distri¬ 
bution  over  all  morphemes  and  then  inferring  the  most  likely  analysis. 

Hyperparemeters.  To  run  the  unsupervised  analyzer,  it  is  necessary  to  specify  the 
Dirichlet  hyperparameters  (ap,  as,  aL)  which  control  the  sparsity  of  the  inferred  pre¬ 
fix,  stem,  and  suffix  lexicons,  respectively.  The  learned  morphological  grammar  is 
(rather  unfortunately)  very  sensitive  to  these  settings,  and  some  exploration  is  neces¬ 
sary.  As  a  rule  of  thumb,  we  observe  that  ap,as  <  at  <  1  is  necessary  to  recover 
useful  segmentations,  as  this  encodes  that  there  are  many  more  possible  stems  than 
inflectional  affixes;  however  the  absolute  magnitude  will  depend  on  a  variety  of  fac¬ 
tors.  Default  values  are  ocp  =  as  =  1 0  6 ,  at  =  1 0  ,  these  may  be  adjusted  by  factors 

of  10  (larger  to  increase  sparsity;  smaller  to  decrease  it). 
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Unsupervised  morphology  features:  ip(p.)  For  the  unsupervised  analyzer,  we  do 
not  have  a  mapping  from  morphemes  to  grammatical  features  (e.g.,  +past);  how¬ 
ever,  we  can  create  features  from  the  affix  sequences  obtained  after  morphological 
segmentation.  We  produce  binary  features  corresponding  to  the  content  of  each  po¬ 
tential  affixation  position  relative  to  the  stem.  For  example,  the  unsupervised  analysis 
wa+ki+wa+sTEM  of  the  Swahili  word  ivakiwapiga  will  produce  the  following  features: 

Prefix[-3] [wa]  Prefix! - 2] [ki]  Prefix! -1] [wa] . 

3.3.  Inflection  Model  Parameter  Estimation 

From  the  analyzed  parallel  corpus  (source  side  syntax  and  target  side  morpho¬ 
logical  analysis),  morphogen  sets  the  parameters  W  and  V  of  the  inflection  predic¬ 
tion  model  (Eq.  1)  using  stochastic  gradient  descent  to  maximize  the  conditional  log- 
likelihood  of  a  training  set  consisting  of  pairs  of  source  sentence  contextual  features 
(<p)  and  target  word  inflectional  features  (ip).  The  training  instances  are  word  align¬ 
ment  pairs  from  the  full  training  corpus.  When  morphological  category  information 
is  available,  an  independent  model  may  be  trained  for  each  open-class  category  (e.g., 
nouns,  verbs);  but,  by  default  a  single  model  is  used  for  all  words  (excluding  words 
shorter  than  a  minimum  length). 

It  is  important  to  note  here  that  our  richly  parameterized  model  is  trained  on  the 
full  parallel  training  corpus,  not  just  on  the  small  number  of  development  sentences. 
This  is  feasible  because,  in  contrast  to  standard  discriminative  translation  models 
which  seek  to  discriminate  good  complete  translations  from  bad  complete  transla¬ 
tions,  morphogen's  model  must  only  predict  how  good  each  possible  inflection  of  an 
independently  generated  stem  is.  All  experiments  reported  in  this  paper  used  models 
trained  on  a  single  processor  using  a  Cython  implementation  of  the  SGD  optimizer.^ 

4.  Synthetic  Phrases 

Flow  is  morphogen  used  to  improve  translation?  Rather  than  using  the  translate- 
and-inflect  model  directly  to  perform  translation,  we  use  it  just  to  augment  the  set  of 
rules  available  to  a  conventional  hierarchical  phrase-based  translation  model  (Chiang, 
2007;  Dyer  et  al.,  2010).  We  refer  to  the  phrases  it  produces  as  synthetic  phrases.  The 
aggregate  grammar  consists  of  both  synthetic  and  "default"  phrases  and  is  used  by 
an  unmodified  decoder. 

The  process  works  as  follows.  We  use  the  suffix-array  grammar  extractor  of  Lopez 
(2007)  to  generate  sentence-specific  grammars  from  the  fully  inflected  version  of  the 
training  data  (the  default  grammar)  and  also  from  the  stemmed  variant  of  the  training 


sFor  our  largest  model,  trained  on  3.3M  Russian  words,  n  =  231 K  *  m  =  336  feature  were  produced, 
and  10  SGD  iterations  at  a  rate  of  0.01  were  performed  in  less  than  16  hours. 
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Russian  supervised 

Verb:  1st  Person 

child(nsubj)=I  child(nsubj)=we 
Verb:  Future  tense 

child(aux)=MD  child(aux)=will 
Noun:  Animate 

source=animals/victims/ . . . 
Noun:  Feminine  gender 

source=obama/economy/. . . 

Noun:  Dative  case 
parent(iobj) 

Adjective:  Genitive  case 
grandparent (poss) 


Hebrew 

Suffix  D’  (masculine  plural) 
parent=NNS  after=NNS 
Prefix  x  (first  person  sing.  +  future) 
child(nsubj)=I  child(aux)=' 11 
Prefix  D  (preposition  like/as) 
child(prep)=IN  parent=as 
Suffix  ’  (possesive  mark) 

before=my  child(poss)=my 
Suffix  n  (feminine  mark) 

child(nsubj)=she  before=she 
Prefix  CD  (when) 

before=when  before=WRB 


Swahili 

Prefix  li  (past) 

source=VBD  source=VBN 
Prefix  nita  ( 1  st  person  sing.  +  future) 
child(aux)  child(nsubj)=I 
Prefix  ana  (3rd  person  sing.  +  present) 
source=VBZ 

Prefix  wa  (3rd  person  plural) 

before=they  child(nsubj)=NNS 
Suffix  tu  (1st  person  plural) 

child(nsubj)=she  before=she 
Prefix  ha  (negative  tense) 
source=no  after=not 


Figure  2.  Examples  of  highly  weighted  features  learned  by  the  inflection  model.  We  selected  a 
few  frequent  morphological  features  and  show  their  top  corresponding  source  context  features. 


data  (the  stemmed  grammar).  We  then  extract  a  set  of  translation  rules  that  only  con¬ 
tain  terminal  symbols  (sometimes  called  "lexical  rules")  from  the  stemmed  grammar. 
The  (stemmed)  target  side  of  each  such  phrase  is  then  re-inflected  using  the  inflection 
model  described  above  (§2),  conditioned  on  the  source  sentence  and  its  context.  Each 
stem  is  given  its  most  likely  inflection.  The  resulting  rules  are  added  to  the  default 
grammar  for  the  sentence  to  produce  the  aggregate  grammar. 

The  standard  translation  rule  features  present  on  the  stemmed  grammar  rules  are 
preserved,  and  mo  rphogen  adds  the  following  features  to  help  the  decoder  select  good 
synthetic  phrases:  (i)  a  binary  feature  indicating  that  the  phrase  is  synthetic;  (ii)  the  log 
probability  of  the  inflected  form  according  to  the  inflection  model;  and  (iii)  if  available, 
counts  of  the  morphological  categories  inflected. 

5.  Experiments 

We  briefly  report  in  this  section  on  some  experimental  results  obtained  with  our 
tool.  We  ran  experiments  on  a  1 50k  sentence  Russian-English  task  (WMT2013;  news¬ 
commentary),  a  1 34k  sentence  English-Hebrew  task  (WIT3  TED  talks  corpus),  and  a 
1 5k  sentence  English-Swahili  Task.  Space  precludes  a  full  discussion  of  the  perfor¬ 
mance  of  the  classifier,^  but  we  can  also  inspect  the  weights  learned  by  the  model  to 
assess  the  effectiveness  of  the  features  in  relating  source-context  structure  with  target- 
side  morphology.  Such  an  analysis  is  presented  in  Figure  2. 


9We  present  our  approach  and  the  results  of  both  the  intrinsic  and  extrinsic  evaluations  in  much  more 
depth  in  Chahuneau  et  al.  (in  review) 
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EN — >RU 

EN— >HE 

EN— >SW 

Baseline 

14.7±o.i 

15.8±0.3 

18.3±o.i 

-1-Class  LM 

15.7±o.i 

16.8±o.4 

18.7±o.2 

-t-Synthetic 

unsupervised 

16.2±o.i 

17.6±0.i 

19.0±o.i 

supervised 

16.7±o.i 

— 

— 

Table  2.  Translation  quality  (measured  by  bleuj  averaged  over  3  MIRA  runs. 


5.1.  Translation 

We  evaluate  our  approach  in  the  standard  discriminative  MT  framework.  We  use 
cdec  (Dyer  et  ak,  2010)  as  our  decoder  and  perform  MIRA  training  (Chiang,  2012)  to 
learn  feature  weights.  We  compare  the  following  configurations: 

•  A  baseline  system,  using  a  4-gram  language  model  trained  on  the  entire  mono¬ 
lingual  and  bilingual  data  available. 

•  An  enriched  system  with  a  class-based  n-gram  language  model—  trained  on  the 
monolingual  data  mapped  to  600  Brown  clusters.  Class-based  language  mod¬ 
eling  is  a  strong  baseline  for  scenarios  with  high  out-of-vocabulary  rates  but  in 
which  large  amounts  of  monolingual  target-language  data  are  available. 

•  The  enriched  system  further  augmented  with  our  inflected  synthetic  phrases. 
We  expect  the  class-based  language  model  to  be  especially  helpful  here  and  cap¬ 
ture  some  basic  agreement  patterns  that  can  be  learned  more  easily  on  dense 
clusters  than  from  plain  word  sequences. 

We  evaluate  translation  quality  by  translating  and  measuring  BLEU  on  a  held-out 
evaluation  corpus,  averaging  the  results  over  3  MIRA  runs  (Table  2).  For  all  languages, 
using  class  language  models  improves  over  the  baseline.  When  synthetic  phrases  are 
added,  significant  additional  improvements  are  obtained.  For  the  English-Russian 
language  pair,  where  both  supervised  and  unsupervised  analyses  can  be  obtained, 
we  notice  that  expert-crafted  morphological  analyzers  are  more  efficient  at  improving 
translation  quality. 

6.  Morphogen  Implementation  Discussion  and  User's  Guide 

This  section  describes  the  open-source  Python  implementation  of  this  work,  mor¬ 
phogen.—  Our  decision  to  use  Python  means  the  code — from  feature  extraction  to 
grammar  processing — is  generally  readable  and  simple  to  modify  for  research  pur¬ 
poses.  For  example,  with  few  changes  to  the  code,  it  is  easy  to  expand  the  number  of 

10For  Swahili  and  Hebrew,  n  =  6;  for  Russian,  n  =  7. 

11https://github. com/eschling/morphogen 
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synthetic  phrases  created  by  generating  k-best  inflections  (rather  than  just  the  most 
probable  inflection),  or  to  restrict  the  phrases  created  based  on  some  source  side  cri¬ 
terion  such  as  type  frequency,  POS  type,  or  the  like. 

Since  there  are  many  processing  steps  that  must  be  coordinated  to  run  morphogen, 
we  provide  reference  workflows  using  duett  ape—  for  both  supervised  and  unsuper¬ 
vised  morphological  analyses  (discussed  below).  While  these  workflows  are  set  up  to 
be  used  with  cdec,  morphogen  generates  grammars  that  could  be  used  with  any  de¬ 
coder  that  supports  per-sentence  grammars.  The  source  language  processing,  which 
we  do  for  English  using  TurboParser  and  TurboTagger,  could  be  done  with  any  tagger 
and  any  parser  that  can  produce  basic  Stanford  dependencies.  The  source  language 
does  not  necessarily  need  to  be  English,  although  our  approach  depends  on  having 
detailed  source  side  contextual  information.— 

We  now  review  the  steps  that  must  be  taken  to  run  morphogen  with  either  an  ex¬ 
ternal  (generally  supervised)  morphological  analyzer  or  the  unsupervised  morpho¬ 
logical  analyzer  we  described  above.  These  steps  are  implemented  in  the  provided 
duettape  workflows. 


Running  mo  rphogen  with  an  external  morphological  analyzer.  If  a  supervised  mor¬ 
phological  analyzer  is  used,  the  parallel  training  data  must  be  analyzed  on  the  target 
side,  with  each  line  containing  four  fields  (source  sentence,  target  sentence,  target 
stem  sentence,  target  analysis  sequence),  where  fields  are  separated  with  the  triple 
pipe  (Ml)  symbol.  Target  language  monolingual  data  must  likewise  be  analyzed  and 
provided  in  a  file  where  each  line  contains  three  fields  (sentence,  stem  sentence,  anal¬ 
ysis  sequence)  and  separated  by  triple  pipes.  For  supervised  morphological  anal¬ 
yses,  the  user  must  also  provide  a  python  configuration  file  that  contains  a  func¬ 
tion  get  att  ributes,—  which  parses  the  string  representing  the  target  morphological 
analysis  into  a  set  of  features  that  will  be  exposed  to  the  model  as  the  target  morpho¬ 
logical  feature  vector  rf>  ( p.) . 


Running  mo  rphogen  with  the  unsupervised  morphological  analyzer.  To  use  unsu¬ 
pervised  morphological  analysis,  two  additional  steps  (in  addition  to  those  required 
for  an  external  analyzer)  are  required: 


12ducttape  is  an  open-source  workflow  management  system  similar  to  make,  but  designed  for  research 
environments.  It  is  available  from  https  :  //git hub.  com/j  hclark/ducttape. 

13It  is  also  unclear  how  effective  our  model  would  be  when  translating  between  two  morphologically 
rich  languages,  since  we  assume  that  the  source  language  expresses  syntactically  many  of  the  things  which 
the  target  language  expresses  with  morphology.  This  is  a  topic  for  future  research,  and  one  that  will  be 
facilitated  by  morphogen. 

14See  the  morphogen  documentation  for  more  information  on  defining  this  function.  The  configuration 
for  the  Russian  positional  tagset  used  for  the  "supervised"  Russian  experiments  is  provided  as  an  example. 
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Tokenized  source:  We  '  ve  heard  that  empty  promise  before  .  ||| 
Tokenized  target  (inflected):  Ho  Mbi  m  paHbiue  c/ibimaun  stm  nycTbie  o6emaHMfl  .  |  |  | 
Tokenized  target  (stemmed):  ho  Mbi  m  paHbiue  c/ibiwaTb  stot  nycTOM  o6emaHtie  .  | 

POS  +  inflectional  features:  C  P-l-pnn  C  R  Vmis-p-a-e  P  —  paa  Afpmpaf  Ncnpan  . 


Figure  3.  Example  supervised  input ;  arrows  indicate  that  the  text  wraps  around  to  the  next  line 
just  for  ease  of  reading  (there  should  be  no  newline  character  in  the  input). 


•  use  fastumorph—  to  get  unsupervised  morphological  analyses  (see  §3.2); 

•  use  seg  tags  .  py  with  these  segmentations  to  retrieve  the  lemmatized  and  tag¬ 
ged  version  of  the  target  text.  Tags  for  unsupervised  morphological  segmenta¬ 
tions  are  a  simple  representation  of  the  learned  segmentation.  Words  less  than 
four  characters  are  tagged  with  an  X  and  subsequently  ignored. 


Remaining  training  steps.  Once  the  training  data  has  been  morphologically  ana¬ 
lyzed,  the  following  steps  are  necessary: 

•  process  the  source  side  of  the  parallel  data  using  TurboTagger,  TurboParser,  and 
Brown  clusters. 

•  use  l  e  x _a  l  i  g  n  .  p  y  to  extract  parallel  source  and  target  stems  with  category  infor¬ 
mation.  This  lemmatized  target  side  is  used  with  cdec's  fast  align  to  produce 
alignments. 

•  combine  to  get  fully  preprocessed  parallel  data,  in  the  form  (source  sentence, 
source  POS  sequence,  source  dependency  tree,  source  class  sequence,  target 
sentence,  target  stem  sequence,  target  morphological  tag  sequence,  word  align¬ 
ment),  separated  by  the  triple  pipe. 

•  use  rev  map  .  py  to  create  a  mapping  from  (stem,  category)  to  sets  of  possible 
inflected  forms  and  their  tags.  Optionally,  monolingual  data  can  be  added  to 
this  mapping,  to  allow  for  the  creation  of  inflected  word  forms  that  appear  in 
the  monolingual  data  but  not  in  the  parallel  training  data.  If  a  (stem,  category) 
pair  maps  to  multiple  inflections  that  have  the  same  morphological  analysis,  the 
most  frequent  form  is  used.— 

•  train  structured  inflection  models  with  SGD  using  st  ruct  t  rain  .  py  A  separate 
inflection  model  must  be  created  for  each  word  category  that  is  to  be  inflected. 
There  is  only  a  single  category  when  unsupervised  segmentation  is  used. 


15https : //git hub . com/vchahun/fast  umorph 

16This  is  only  possible  when  a  supervised  morphological  analyzer  is  used,  as  our  unsupervised  tags  are 
just  a  representation  of  the  segmentation  (e.g.  wa+ku+STEM). 
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Using  mo  rphogen  for  tuning  and  testing.  At  tuning  and  testing  time,  the  following 
steps  are  run: 

•  extract  two  sets  of  per-sentence  grammars,  one  with  the  original  target  side  and 
the  other  with  the  lemmatized  target  side 

•  use  the  extracted  grammars,  the  trained  inflection  models,  and  the  reverse  in¬ 
flection  map  with  synthetic  ^  ramma  r .  py  to  create  an  augmented  grammar  that 
consists  of  both  the  original  grammar  rules  and  any  inflected  synthetic  rules 
(§4).  By  default,  only  the  single  best  inflection  is  used  to  create  a  synthetic  rule, 
but  this  can  be  modified  easily. 

•  add  target  language  model  and  optionally  a  target  class  based  language  model. 
Proceed  with  decoding  as  normal  (we  tune  with  MIRA  and  then  evaluate  on 
our  test  set) 


Using  the  ducttape  workflows.  The  provided  ducttape  workflows  implement  the 
above  pipelines,  including  downloading  all  of  the  necessary  tool  dependencies  so  as 
to  make  the  process  as  simple  as  possible.  The  user  simply  needs  to  replace  the  global 
variables  for  the  dev,  test,  and  training  sets  with  the  correct  information,  point  it  at 
their  version  of  morphogen,  and  decide  which  options  they  would  like  to  use.  Sample 
workflow  paths  are  already  created  (e.g.  path  with/ without  Monolingual  training 
data,  with/without  class  based  target  language  model).  These  can  be  modified  as 
needed. 


Analysis  tools.  We  also  provide  the  scripts  predict .  py  and  showmodel .  py.  The 
former  is  used  to  perform  an  intrinsic  evaluation  of  the  inflection  model  on  held  out 
development  data.  The  latter  provides  a  detailed  view  of  the  top  features  for  various 
inflections,  allowing  for  manual  inspection  of  the  model  as  in  Figure  2.  An  example 
workflow  script  for  the  intrinsic  evaluation  is  also  provided. 

7.  Conclusion 

We  have  presented  an  efficient  technique  which  exploits  morphologically  analyzed 
corpora  to  produce  new  inflections  possibly  unseen  in  the  bilingual  training  data  and 
described  a  simple,  open  source  tool  that  implements  it.  Our  method  decomposes 
into  two  simple  independent  steps  involving  well-understood  discriminative  models. 

By  relying  on  source-side  context  to  generate  additional  local  translation  options 
and  by  leaving  the  choice  of  the  global  sentence  translation  to  the  decoder,  we  sidestep 
the  issue  of  inflecting  imperfect  translations  and  we  are  able  to  exploit  rich  annota¬ 
tions  to  select  appropriate  inflections  without  modifying  the  decoding  process  or  even 
requiring  that  a  specific  decoder  or  translation  model  type  be  used. 
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We  also  achieve  language  independence  by  exploiting  unsupervised  morpholog¬ 
ical  segmentations  in  the  absence  of  linguistically  informed  morphological  analyses, 
making  this  tool  appropriate  for  low-resource  scenarios. 
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